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ABSTRACT 

We  present  an  algorithm  for  tracking  moving  objects  using 
intrinsic  minimal  surfaces  which  handles  particularly  well 
the  presence  of  severe  and  total  occlusions  even  in  the  pres¬ 
ence  of  weak  object  boundaries.  We  adopt  an  edge  based 
approach  and  find  the  segmentation  as  a  minimal  surface  in 
3D  space-time,  the  metric  being  dictated  by  the  image  gra¬ 
dient.  Object  boundaries  are  represented  implicitly  as  the 
level  set  of  a  higher  dimensional  function,  and  no  particular 
object  model  is  assumed.  We  also  avoid  explicit  estima¬ 
tion  of  a  dynamic  model  since  the  problem  is  regarded  as 
one  of  static  energy  minimization.  A  set  of  interior  points 
provided  by  the  user  is  used  to  constrain  the  optimization, 
which  basically  corresponds  to  selecting  the  object  of  inter¬ 
est  within  the  video  sequence.  The  constraints  are  such  that 
they  restrict  the  resulting  surface  to  be  star-shaped  in  the  3D 
spatio-temporal  space.  We  present  some  challenging  exam¬ 
ples  that  show  the  robustness  of  the  technique. 

1.  INTRODUCTION 

Despite  numerous  efforts  by  researchers,  successfully  track¬ 
ing  moving  objects  in  surprisingly  simple  video  sequences 
still  remains  a  challenging  problem.  The  goal  is  to  track 
moving  objects  that  possibly  change  shape  and  are  subject 
to  occlusions  under  stationary  or  moving  camera  conditions. 
A  basic  principle  that  is  repeatedly  used  is  to  start  with  the 
segmentation  on  a  single  frame  and  then  forward-track  its 
evolution  across  subsequent  frames.  Due  to  noise,  occlu¬ 
sions,  poor  motion  models,  etc.,  the  tracking  may  get  lost 
at  some  point  and  re-initialization  procedures  have  to  be  de¬ 
vised.  Another  class  of  approach  attempts  to  use  informa¬ 
tion  from  both  sides,  that  is,  past  and  future  frames  are  both 
taken  into  account  so  increased  reliability  and  robustness 
can  be  achieved.1  In  this  latter  direction,  we  present  an  al¬ 
gorithm  for  tracking  moving  objects  using  intrinsic  minimal 
surfaces  which  integrates  information  across  all  frames  in  a 

Although  these  approaches  cannot  be  applied  to  real-time  applications 
where  future  frames  are  not  available,  there  is  a  large  number  of  off-line 
video  applications  where  these  techniques  can  perform  significantly  better 
than  forward-predictive  techniques. 


given  sequence.  We  adopt  an  edge  based  approach  and  find 
the  segmentation  as  a  minimal  surface  in  3D  space-time. 
Such  techniques  have  been  extensively  used  in  the  litera¬ 
ture  both  for  2D  and  3D  object  segmentation.  The  idea  is 
to  first  design  an  energy  functional  that  is  minimized  at  the 
object  of  interest.  Thereafter,  the  problem  becomes  one  of 
non-convex  optimization  usually  solved  by  following  a  gra¬ 
dient  descent  flow  that  gives  a  local  minimizer  of  the  energy. 
Initialization  of  the  flow  is  a  key  step,  that  will  allow  recov¬ 
ery  of  the  desired  object  provided  the  initial  guess  is  close 
to  the  correct  minima.  Availability  of  robust  optimization 
techniques  is  then  very  important,  not  only  because  they 
guarantee  finding  the  correct  minima  but  also  because  they 
allow  us  to  concentrate  on  the  design  of  the  segmentation 
energy  which  will  ultimately  determine  the  performance  of 
the  algorithm.  One  such  technique  was  introduced  in  [1]  for 
3D  tomograms  segmentation,  and  will  be  used  and  extended 
here  to  get  the  spatio-temporal  minimal  surface.  Segmenta¬ 
tion  is  achieved  in  a  semi-automatic  fashion:  points  inside 
the  object  are  first  specified  by  the  user  in  a  small  number 
of  keyframes.  A  basic  interpolated  trajectory  is  then  ob¬ 
tained  that  allows  us  to  compute  a  cylindrical  transformed 
domain  that  provides  a  very  convenient  setting  for  carrying 
out  the  surface  minimization.  In  a  final  step  we  convert  the 
segmentation  result  back  to  the  original  cartesian  domain. 

2.  BACKGROUND  ON  TRACKING  AND  VIDEO 
SEGMENTATION 

Recently,  sophisticated  dynamical  models  have  been  intro¬ 
duced  to  forward-track  the  evolution  of  contours  over  time 
for  implicitly  represented  contours  in  the  presence  of  occlu¬ 
sions.  In  [2]  significant  occlusions  have  to  be  handled  ex¬ 
plicitly  requiring  the  application  of  detection  mechanisms 
that  are  then  incorporated  into  the  dynamic  model  of  the 
system.  The  authors  in  [3]  also  assume  an  explicit  model  for 
the  shape  and  motion  of  contours.  In  both  these  approaches 
the  balance  between  inertial  versus  image  related  terms  have 
to  be  carefully  set  for  the  contour  to  appropriately  track  the 
object  of  interest.  Tracking  of  walking  persons  under  se¬ 
vere  occlusions  has  also  been  addressed  in  [4].  They  require 


a  stationary  camera  and  enforcement  of  motion  periodicity 
constraints.  Radial  contours  presented  in  [5]  are  closely  re¬ 
lated  to  our  approach  as  it  uses  a  similar  2D  image  segmen¬ 
tation  technique.  However,  it  only  works  on  a  frame-by- 
frame  basis  and  a  separate  object  model  (and  associated  dy¬ 
namics)  need  to  be  estimated  in  order  to  track  the  segment¬ 
ing  curve  across  frames.  Video  segmentation  with  3D  sur¬ 
faces  in  space-time  has  also  been  addressed,  see  [6]  for  ex¬ 
ample.  Here,  the  authors  take  a  region  based  approach  that 
cannot  deal  with  severe  or  total  occlusions  because  region 
properties  are  completely  missing  in  the  occluded  frames. 
The  authors  in  [7]  propose  an  interactive  system  for  the 
segmentation  of  video  sequences.  They  represent  contours 
parametrically  and  integrate  information  from  past  and  fu¬ 
ture  keyframes.  An  interactive  environment  is  provided  for 
the  user  to  progressively  refine  the  location  of  control  points 
across  frames. 


3.  ENERGY  BASED  SEGMENTATION 

Let  T  be  a  contour  embedded  in  5ft3  that  represents  the  bound¬ 
ary  of  interest.  Its  intrinsic  area  is 

J^g(T)d\  (1) 

where  g  :  5ft3  — >  (0,  oo]  is  the  image  derived  metric.  By 
minimizing  the  quantity  in  Equation  (1),  T  is  encouraged 
to  go  through  areas  of  small  cost  (corresponding  to  bound¬ 
aries)  yielding  the  desired  segmentation.  The  metric  has 
typically  the  form:  g  =  f(I)+w,  where  /  depends  on 
the  3D  input  image  /,  and  w  is  a  constant  that  can  controls 
the  smoothness  of  the  minimizing  contour.  For  the  exam¬ 
ples  in  this  paper  we  use:  /(/)  =  |V/1|+e,  where  e  is  a 
small  constant  to  prevent  the  denominator  from  vanishing 
and  Ia  is  a  smoothed  version  of  I.  Note  that  the  image  I 
here  is  the  three  dimensional  cube  in  space-time  obtained 
by  stacking  individual  2D  frames  across  the  time  dimen¬ 
sion,  consequently  the  smoothing  and  gradient  operations 
are  computed  intrinsically  in  3D.  Given  an  initial  contour 
T0,  the  solution  to  the  segmentation  problem  is  given  by 
the  steady  state  of  the  gradient  descent  flow.  Although  this 
technique  has  proven  to  be  very  successful,  it  strongly  de¬ 
pends  on  the  choice  of  the  initial  contour  Tq  which  has  to 
be  close  to  the  desired  minima.  Only  recently,  the  availabil¬ 
ity  of  global  optimization  techniques  is  allowing  to  address 
fundamental  but  challenging  problems  in  image  analysis  as 
that  of  video  segmentation  under  severe  occlusions.  In  par¬ 
ticular,  we  extend  the  minimization  technique  presented  in 
[1],  see  also  [8,  9,  10]  for  related  techniques. 


4.  MINIMAL  SURFACES  FOR  VIDEO 

In  video  segmentation,  the  3D  surface  in  space-time  gen¬ 
erated  by  the  movement  of  an  object  has  usually  a  tubular 
(“cylindrical”)  shape.  That  is,  if  we  disregard  occlusions  for 
the  moment,  integration  of  object  boundaries  across  frames 
will  render  a  surface  with  such  shape.  In  this  setting,  let  us 
analytically  express  the  cylinder’s  axis  as  a  function  of  time 
P(t)  :  5ft  — >  5ft2  which  will  give  the  position  at  frame  t  of  a 
point  inside  the  cylinder.  For  each  frame  we  then  consider 
a  polar  coordinate  system  centered  in  P(t)  and  assume  that 
the  lateral  surface  of  the  cylinder  can  be  expressed  as  a  sin¬ 
gle  valued  function  p(6),  see  Figure  1  (a).  Extending  to  all 
frames,  the  complete  surface  can  then  be  expressed  in  the 
cylindrical  system  (p,  0,  t)  as  a  function  p(Q,  t),  see  Figure 
1  (b).  Although  this  restricts  the  segmentation  technique 
to  the  class  of  star  shaped  surfaces,  it  will  provide  a  nice 
and  general  enough  setting  to  solve  the  optimization  prob¬ 
lem  as  described  below.  Assuming  that  P(t)  is  known,  the 
segmentation  problem  is  equivalent  to  the  search  of  a  func¬ 
tion  p(0,  t)  representing  the  surface  minimizing  the  energy 
in  Equation  (1).  As  similarly  discussed  in  [1,  11],  such  a 
geometric  construction  introduces  a  scaling  factor  -p  on  the 
metric  that  avoids  the  global  minima  of  zero  energy.  That  is, 
for  uniform  g,  concentric  cylinders  p(0,  t)  =  constant  will 
all  be  minimizers  of  the  energy. 

Assuming  the  position  of  points  inside  the  object  changes 
slowly  from  frame  to  frame,  the  axis  P(t)  can  be  specified 
as  a  sequence  of  discrete  points  at  a  few  selected  keyframes, 
see  Figure  2.  From  this  set  of  points,  we  can  linearly  in¬ 
terpolate  in-between  to  get  the  coordinates  of  intermediate 
points.  The  number  of  points  required  is  dependent  on  the 
complexity  of  the  movement,  the  simplest  ones  will  require 
a  minimum  of  two  points  (i.e.  in  the  first  and  last  frames) 
but  more  complex  motion  paths  may  require  additional  in¬ 
termediate  points  in  order  to  guarantee  that  interpolated  po¬ 
sitions  will  fall  inside  the  object  in  all  intermediate  frames.2 
Note  that  we  do  not  require  complete  determination  of  the 
object’s  motion,  but  only  an  approximation  so  the  correct 
object  geometry  is  recovered. 

4.1.  Finding  the  minimal  surface 

The  optimization  is  done  with  the  technique  presented  in 
[1].  The  basic  idea  is  to  solve  the  minimal  surface  prob¬ 
lem  by  sectioning  the  3D  domain  with  2D  planes  and  find 
geodesics  restricted  to  the  cutting  sections.  If  we  assume 

2Note  that  more  sophisticated  interpolation  schemes  can  also  be  used, 
for  example,  polynomial  or  splines  interpolants.  Even  dynamic  models  for 
the  motion  of  single  points  can  be  incorporated  so  fewer  points  need  to 
be  specified.  Tracking  of  a  single  axis  point  can  easily  be  done  with  tra¬ 
ditional  tracking  techniques  or  as  the  computation  of  minimal  paths  using 
the  adequate  image  metric.  That  is,  we  can  automatically  compute  the  axis 
as  a  geodesic  curve  (on  a  metric  space  dictated  by  image  features)  between 
a  pair  of  selected  end  points  in  the  keyframes. 


Fig.  1.  (a)  Polar  coordinate  transformation  at  each  individ¬ 
ual  frame  t.  (b)  Cylindrical  transformed  domain  in  p,  6,  and 
t  coordinates.  To  guarantee  the  tubular  shape  in  the  Carte¬ 
sian  domain,  the  transformed  surface  has  to  be  periodic  in 
the  6  direction. 


Fig.  2.  Approximating  a  simple  motion  trajectory  by  linear 
interpolation.  Selected  points  must  be  inside  the  object  of 
interest  at  each  frame  and  interpolated  positions  must  also 
fall  inside  the  object  in  all  intermediate  frames. 


that  each  such  2D  geodesic  corresponds  to  the  intersection 
of  the  3D  minimal  surface  with  the  slicing  plane,  we  can  re¬ 
cover  the  surface  as  the  collection  of  curves.  Planar  geodesics 
between  two  points  are  computed  by  a  non-iterative  proce¬ 
dure  [12].  This  procedure  can  easily  be  modified  to  handle 
periodic  geodesics,  needed  for  the  computation  of  geodesics 
in  the  6  direction  in  order  to  enforce  the  cylindric  shape.  The 
volume  is  traversed  in  both  directions  ( 0  and  t ,  see  Figure 
1  (b)  )  in  increasing  order  according  to  the  intrinsic  length 
of  corresponding  geodesics.  By  first  processing  slices  with 
shorter  intrinsic  lengths  we  are  relying  on  areas  where  the 
metric  is  strongly  anisotropic.  As  we  continue  to  process 
sections  of  increasing  cost  we  start  gradually  enforcing  the 
restriction  that  geodesics  in  both  directions  should  be  in 
agreement  with  each  other  as  they  are  part  of  a  single  sur¬ 
face.  For  further  details  on  the  computation  of  the  minimal 
surface  we  refer  the  reader  to  [1].  Once  the  segmentation 
in  the  transformed  domain  is  obtained,  we  bring  the  result 
back  to  the  Cartesian  grid. 


sions.  The  passing  truck  takes  30  frames  which  means  that 
useful  boundary  information  is  completely  missing  in  those 
frames.  Figure  5  shows  the  recovered  surface  in  3D.  Figure 


Fig.  3.  Sequence  of  a  car  turning  at  a  stop  sign.  From  left  to 
right,  top  to  bottom:  frames  1,  17,  117,  145,  150,  158,  162, 
167,  177,  181,  250  and  280. 

6  shows  the  segmentation  of  a  passing  car  occluded  by  trees. 
This  sequence  has  a  very  complex  background  including 
other  moving  objects  (the  truck  on  the  back),  very  weak 
edges  and  some  small  camera  movements.  Not  only  the 
shape  of  the  car  changes  across  frames  (because  of  the  per¬ 
spective),  but  also  its  area  grows  dramatically  as  it  approaches 
the  camera  (from  347  to  2428  pixels),  illustrating  again  the 
capacity  of  the  algorithm  in  adapting  to  arbitrary  shapes. 
The  results  were  obtained  from  four  points  selected  in  the 
first,  last,  frames  20  and  50.  Additional  points  were  needed 
in  this  case  because  the  speed  of  the  car  changes  signifi¬ 
cantly  and  it  cannot  be  linearly  predicted  from  the  first  and 
last  frames. 

The  running  time  of  the  algorithm  is  about  a  minute  per 
hundred  frames  in  a  1.2  Mhz  laptop  computer  without  care¬ 
ful  optimization.  Movie  sequences  are  available  for  viewing 
at:  http://mountains.ece.umn.edu/^abarte/sequences. 


5.  EXAMPLES  ON  REAL  SEQUENCES 

We  present  segmentation  results  for  two  different  movies  of 
real  scenes  with  car  objects.  Figures  3  and  4  show  results 
for  a  280-frame  movie  segmented  from  only  three  points. 
Observe  how  shape  changes  are  handled  correctly  and  how 
the  segmentation  is  not  biased  by  the  presence  of  occlu¬ 


6.  SUMMARY  AND  DISCUSSION 

We  presented  an  energy  based  video  segmentation  algorithm 
that  finds  objects  boundaries  as  minimal  surfaces  in  3D  space- 
time  domain.  Geometric  constraints  are  enforced  by  spec¬ 
ifying  points  inside  the  object  to  be  segmented.  The  tech¬ 
nique  is  both  robust  and  accurate  and  can  track  objects  that 
change  shape  and  are  subject  to  severe  and  total  occlusions. 


Fig.  4.  Zoomed  in  view  for  the  turning  car  sequence.  The 
three  selected  points  are  shown  in  red.  From  left  to  right, 
top  to  bottom:  frames  1,  17,  117,  145,  150,  158,  162,  167, 
177,  181,250  and  280. 
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